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Abstract 

Background: Complex traits may be defined by a range of different criteria. It would result in a loss of information 
to perform analyses simply on the basis of a final clinical dichotomized affected / unaffected variable. 

Results: We assess the performance of four alternative approaches for the analysis of multiple phenotypes in 
genetic association studies. We describe the four methods in detail and discuss their relative theoretical merits and 
disadvantages. Using simulation we demonstrate that PCA provides the greatest power when applied to both 
correlated phenotypes and with large numbers of phenotypes. The multivariate approach had low type I error only 
with independent phenotypes or small numbers of phenotypes. In this study, our application of the four methods 
to schizophrenia data provides converging evidence of the relative performance of the methods. 

Conclusions: Via power analysis of simulated data and testing of experimental data, we conclude that PCA, 
creating one variable based on a linear combination of all the traits, performs optimally. We propose that our 
comparison will provide insight into the properties of the methods and help researchers to choose appropriate 
strategy in future experimental studies. 
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Background 

For linkage and genetic association studies of biological 
markers, a complex trait can be defined by a range of 
multiple and often overlapping criteria. For example, 
obesity, usually defined by body mass index (BMI), is 
also related to waist-hip ratio (WHR) and body fat per- 
centage. More than one indicator is usually used. It is 
possible that the specific type of the indicator selected 
may favor one susceptibility gene, while selection on an- 
other indicator may reveal another gene. In early 
genome-wide association studies, a common variant of 
the FTO gene was implicated in increased BMI and to 
predispose to childhood and adult obesity [1]. Later, a 
meta-analysis of 61 studies concluded that multiple loci 
affect WHR independently of BMI [2]. In this example, 
WHR and BMI may reflect different aspects of the 
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underlying gene effect, demonstrating the importance of 
utilizing multiple phenotype data in the analysis, al- 
though chance fluctuation may also lead to the different 
results of BMI and WHR. 

Multiple intermediate phenotypes have been proposed 
for a variety of neuropsychiatric disorders, in particular 
schizophrenia, bipolar disorder, and Alzheimer's disease. 
For example in Alzheimer's disease, impairment occurs 
in eight cognitive domains including attention, language, 
memory, perceptual skills, constructive abilities, orienta- 
tion, problem solving and functional abilities [3]. Typically 
the resulting measures are statistically or functionally 
correlated, which then increases the difficulty of handling 
such multivariate data. So when subjects clinically are 
diagnosed as either affected or unaffected for a disorder, 
this dichotomization may lead to a loss of power in 
genetic analyses. 

In a recent review, Ott et al [4] described four ap- 
proaches to tackle multiple phenotypes. The first is the 
most general and proposes to analyze each phenotype 
individually and correct for multiple testing by the 
Bonferroni method. The second is similar to the first but 
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proposes permutation analysis to address problems of 
multiple testing. The third is to treat different pheno- 
types as a multivariate regression problem. The last is to 
transform all phenotype data into a single overall pheno- 
type using principal component analysis (PCA) and then 
to perform standard univariate regression at each bio- 
marker. There is so far no consensus on which method 
is the best. 

The primary purpose of our study was to assess the 
performance of these four approaches. We introduce the 
four methods in detail and discuss their relative advan- 
tages and disadvantages. Then through power analysis of 
extensively simulated data and a real data application, 
we conclude that for genetic association studies, using 
PCA to create one variable based on a linear combin- 
ation of all the traits performs optimally. 

Methods 

One at a time 

The most intuitive and simplest way to deal with multiple 
phenotypes is to test each SNP against one phenotype at a 
time. In the case of quantitative traits, a one-way analysis 
of variance (ANOVA) is usually performed. It tests 
whether the mean of a phenotype is the same in the three 
genotypes, AA, AB, and BB. As an alternative to ANOVA, 
we can perform a simple linear regression for each pheno- 
type as a response variable and the genotypes as predictors 
(this analysis has 1 degree of freedom [df] versus 2 for 
ANOVA). Each of the phenotypes would measure a trait 
from a particular different angle. A given SNP could be as- 
sociated with none, one or more of the phenotypes. In 
practice, researchers have to make decisions on the criteria 
for declaring significance and interpretation. Usually, if any 
of the phenotypes result in a statistically significant out- 
come, the SNP is retained for further investigation. This is 
also what we do in this study, that is, if the smallest ^-value 
across phenotypes is less than a pre-determined threshold, 
we suspect the SNP is a genetic risk factor for the trait. 

Since multiple tests are conducted at a SNP, we need 
to handle the resulting ^-values by controlling the over- 
all type I error. In the situation of a single test, a result 
is declared significant when p < 0.05 if the type I error a 
is controlled at 5% by convention. With m independent 
tests, the probability of making correct decisions on all 
the results is (1 - p) m , given the null hypothesis is true. 
So the probability of finding at least one false positive is 
1 - (1 - p) m . This overall type I error a is called the 
family-wise error rate. It is approximated by mp when m is 
large and p is small and we want to keep a < 0.05 [5]. Thus, 
we should set 0.05/m as the threshold to declare a single 
test significant. This approach represents the well-known 
Bonferroni correction. Unfortunately, the disadvantage is 
that the correction tends to be too stringent when tests are 
dependent, which is often the case with endophenotypes. 



This correction causes more false negatives so power is de- 
creased. An alternative to Bonferroni correction is to use 
permuted p-values as discussed in the next section. 

Permuted p-values 

For a given SNP, we define the best test statistic, F max or 
P min , among associations with all phenotypes as our final 
test statistic for this SNP (P m i n is preferable in the pres- 
ence of a mixture of categorical and quantitative pheno- 
types). In the first approach of one SNP at a time, we 
assess the significance level associated with a test statis- 
tic by looking it up in a known statistical distribution 
table. Since the null distribution of P min may not have a 
known distribution, we approximate it by simulating 
datasets under no association. The procedures are firstly 
to permute sample labels in phenotypes but keeping ge- 
notypes in their original order. Obviously, in this new 
dataset we destroy any association between phenotypes 
and genotypes by randomization. Secondly, we obtain 
and store the smallest ^-value in such a dataset. We re- 
peat the randomization a sufficiently large number of 
times. The smallest p-values stored would approximate 
the distribution of P min under the null hypothesis. 
Finally, we calculate the proportion of the smallest 
p-values in the distribution less than or equal to the 
observed P min to be the significance level associated with 
Anin [6]. It is often believed that a permuted j^-value is 
not as conservative as a Bonferroni-corrected p-value 
and, thus, is more powerful. We will revisit this issue in 
the simulation section. 

Multivariate analysis 

The obvious drawback to the first two approaches is that 
they do not utilize information from the structure of 
multiple phenotypes, which may or may not to be corre- 
lated. Given there are not too many phenotypes, we 
could carry out regression or multivariate analysis of 
variance (MANOVA) with multiple phenotypes directly. 
In multivariate regression, the response variables are as- 
sumed to follow some specific multivariate distribution, 
most commonly a multi-normal distribution, although 
this is a strong and sometimes unwarranted assumption. 

Principal component analysis 

Another method of analyzing several phenotypes simul- 
taneously is to summarize them into one overall value. 
The simplest summary statistic is to take the mean or 
sum (often called a "scale"). But in real-life examples, dir- 
ectly adding phenotype values does not always make 
sense. For example, it is difficult to interpret one's height 
plus weight plus intelligence quotient (IQ). Instead of a 
simple sum, we may consider a weighted linear combin- 
ation of phenotypes with weights based on the inverse of 
their variances. In principal component analysis (PCA), 
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such a weighted sum is called the first principal compo- 
nent (PC). This technique of dimension reduction is 
often used in the presence of a large number of predic- 
tors in a regression model Here, we apply the technique 
to condense information in outcome variables, that is, 
phenotypes. 

Software 

All the simulation and data analysis described in this 
paper was conducted in the R statistical programming 
language (http://www.r-project.org). 

Dataset and preprocessing procedures 

We first describe our analysis of a dataset on schizophrenia, 
followed by extensive simulation of computer-generated 
data. Investigators collected clinical, cognitive, MRI and 
genetic information of European subjects in families with 
schizophrenia and controls. Family members include 
mainly parents and siblings. All case subjects are individuals 
with a diagnosis of schizophrenia. Additional description of 
the studies, including methods for selecting control subjects 
and diagnosis is provided in Toulopoulou et aU (2003; 
2003S; 2004; 2007; 2010) [7-11] and Owens et aU [12] and 
in the supplementary material (see Additional file 1). 

Results 

Real data analysis 

We select all available 216 unrelated schizophrenia pa- 
tients and 240 unrelated controls and analyze genotypes 



in case and control subjects for each of 51 candidate 
SNPs potentially associated with schizophrenia. Pheno- 
types of interest are subjects' cognition level measured 
from the angle of IQ and memory. More explicitly, they 
are two summary scores from the Wechsler Adult 
Intelligence Scale (WAIS) test to measure intelligence 
through verbal and performance subtests, one score 
from the National Adult Reading Test (NART) estimat- 
ing premorbid intelligence levels, and two scores from 
the Wechsler Memory Scale (WMS) test that measures 
logical memory in the form of immediate memory cap- 
acity and delayed memory performance. We apply the 
four approaches described in the methodology section to 
conduct two sets of analyses: one with all five pheno- 
types and the other with the three IQ phenotypes only. 

Results are given in Table 1, where we see that the 
three IQ and two memory variables are correlated with 
an average correlation coefficient of 0.59. Complete 
phenotype data are available for a total of 118 case sub- 
jects. Rates of missing genotypes ranged from 21% to 
56%. Out of 51 SNPs, rs3738401 is ranked highest 
(smallest j^-value) regardless of which method is used. It 
has previously been identified to be associated with 
schizophrenia in the Scottish population with a relative 
risk of 5 [13]. Note that in our data, this SNP is associated 
with the five phenotypes (three IQ and two memory vari- 
ables) only in patients and not in controls. Among the four 
methods, PCA results in the smallest ^-value of 0.002. The 
one-by-one and permutation approaches have similar 



Table 1 Top five SNPs discovered using the four methods 
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*Bonferroni correction is applied to correct for multiple testing among phenotypes. The exact formula used is 1 - (1 - p) m . 

There are two sets of analyses, one with five moderately correlated phenotypes and the other with three highly correlated phenotypes. 
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p-values of 0.008 and 0.009, respectively, followed by 
MANOVA with a Rvalue of 0.031. 

When we reduce the phenotypes to a highly correlated 
subset, the three IQ scores have a correlation coefficient 
of 0.72. We notice that MANOVA is then no longer able 
to pick up the potential risk variant, rs3738401, at a 
significance level of 0.05. The significance of this SNP in 
the other three methods also drops, possibly due to 
higher correlation, or fewer phenotypes, or both, 
although the sample size is slightly increased. We will 
investigate the relationship between power and correl- 
ation coefficient, number of phenotypes and so on in the 
next section. 



Simulation 

We simulate extensively to assess which of the ap- 
proaches would give highest power under various set- 
tings. We assume a multivariate, normally distributed 
phenotype associated with a risk SNP. Frequency of al- 
lele A is set to 0.3 throughout the simulation study. 
Other parameters in the simulation are the number of 
individuals n, the number of phenotypes m, effect size £, 
and the correlation coefficient r among the phenotypes. 
Here, 8 means that if we set the mean of phenotypes for 
individuals with genotype BB to be unit 1, mean of phe- 
notypes for individuals with genotype AB and AA would 
be equal to 8 and 8 2 , respectively. This inheritance 
model is analogous to the genotypic relative risk model, 
in which the chance of an individual having the disease 
increases by a factor 8 with an increasing copy of the 
risk allele in the genotypes [14]. We vary values of these 
parameters to investigate the pattern of power for the 
four approaches. 

Firstly, we examine the relationship between power 
and correlation among the phenotypes. While the pa- 
rameters n, m and 8 are fixed at 300, 10 and 1.2, respect- 
ively, r varies from 0 to 0.9, that is, from complete 
independence to a high correlation. We want to generate 
10,000 replicates of the SNP of interest, that is, 20,000 of 
its alleles. To generate a genotype, we assume Hardy 
Weinberg equilibrium (HWE) so that the two alleles in a 
genotype can be ascertained independently. To achieve 
this, allele A is generated from a binomial distribution 
(10,000, 0.3), and separately, allele B from a binomial 
distribution (10,000, 0.7). Under the assumption of 
HWE, given an allele frequency of 0.3, the mean geno- 
type frequencies of AA, AB and BB are 0.09, 0.42 and 
0.49, respectively. Phenotype data are then generated 
from a multivariate normal distribution with mean 
values of 8 2 , 8 and 1. 

In Figure 1, power is plotted against the correlation 
coefficient under the alternative hypothesis of 8* 1. The 
patterns for the change of power are quite different for 
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Figure 1 Power versus correlation coefficient. Phenotypes are 

generated from multivariate normal distribution. Correlation 

coefficient ranges from 0 to 1. 
k ) 

the four approaches. When phenotypes are correlated, 
even to a mild degree, MANOVA performs the worst. 
Its power decreases dramatically with an increase of the 
correlation coefficient. Performance of the one-by-one 
and permuted j^-value approaches also has an inverse re- 
lationship with the correlation coefficient, although their 
rates of decrease are slower. The PCA approach has the 
most interesting pattern, which does not show as a 
monotonic curve. Power first increases, then decreases. 
To take a closer look at where the peak occurs, we simu- 
late another set of SNPs with all parameter settings the 
same as above except the correlation coefficient varies 
from 0 to 0.2. Figure 2 reveals the maximum power to 
occur when the correlation coefficient is approximately 
0.05. We are not sure how to explain this unusual 
pattern. 

Next, we check the relationship between power and 
number of phenotypes, as shown in Figure 3, where r is 
fixed at 0.5. Interestingly, for all methods except 
MANOVA, power increases with more phenotypes in- 
cluded. The performance of the one-by-one and per- 
muted p-value approaches goes hand in hand and PCA 
is the best. 

In the end, we present two traditional graphs in power 
analysis, where power is plotted against effect size and 
sample size, shown in Figures 4 and 5. As expected, 
power improves with increases of sample size and effect 
size. PCA has again the best performance, followed by 
the one-by-one and permutation methods, which do not 
really differ, and MANOVA. 

Note that for all simulations above, power is calculated 
as the proportion of ^-values less than a pre-determined 
threshold. In practice, we often fix 0.05 as that threshold 
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sample size = 300; delta = 1.2; norm = TRUE; cali = fiom 0 to 0.2; nsim = 10000 
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Figure 2 Power versus correlation coefficient. Phenotypes are 
generated from multivariate normal distribution. Correlation 
coefficient ranges from 0 to 0.2. 



such that the estimated type I error is controlled at 
about 5% and, in principle, power is expected to be 
about 5% under the null hypothesis, H 0 : 8 = 1. However, 
we do not know whether it is always the same case for 
the four approaches, especially when correlation coeffi- 
cient and number of phenotypes also vary. In supple- 
mentary method in Additional file 1, we address the 
issue of selecting a proper threshold in detail. 

We also want to check the performance of these four 
methods when phenotype distributions deviate from 
normal. Using the same parameter setting as above, we 
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Figure 4 Power versus effect size. Number of phenotypes m = 10. 



find similar patterns for the four methods (See supple- 
mentary Figures 2 and 3 in Additional file 1). 

Discussion 

We have described four methods to analyze multiple ob- 
served phenotypes for linkage and association studies of 
complex traits. For any given marker, it is likely that a 
simple dichotomized phenotype of affected versus un- 
affected is not clearly associated with the marker due to 
relatively subjective definition and characterization of 
disease status, especially for psychiatric traits. Disease 
definition is not generally based on the genetics of the 
trait but is established for the purpose of unique 
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Figure 3 Power versus number of phenotypes. Phenotypes are 
moderately correlated. 
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Figure 5 Power versus sample size. Number of phenotypes 
m = 10. 
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classification. Thus, a range of phenotypes is required to 
capture all underlying genetic risk factors with different 
functions. Hypertension in Lyon hypertensive rats is a 
good example. It was shown that two different blood 
pressure measurements, diastolic and pulse, are associ- 
ated with different genes which might have been missed 
based on the conventional measurement of systolic and 
diastolic blood pressure [15]. Furthermore, because of 
the correlation structure of many of the phenotypes 
measured, using PCA to combine the attributes linearly 
to then simultaneously analyze all phenotypes may be 
more informative than a straightforward univariate or 
multivariate approach. Our results from both real and 
simulated data imply that statistical power and validity 
can be increased through the PCA approach. 

In this study, we investigate the performance of differ- 
ent approaches to analyse multiple continuous pheno- 
types and recommend PCA as the optimal method. 
When phenotypes comprise both discrete and continu- 
ous variables, each discrete phenotype can be non- 
linearly transformed [16] before being included as input 
of the PCA. It is worth studying explicitly the perform- 
ance of different approaches on both continuous and 
categorical data in future work. In addition, PCA pro- 
vides several components. In Additional File 1, we fur- 
ther discuss whether it is better to include more PCs as 
outcome variables. Preliminary results show that PCA 
with the first two PCs does not achieve the same power 
as PCA with the first PC, but still performs among the 
best especially when correlation between phenotypes is 
relatively low (see supplementary Figure 4 in Additional 
file 1). 

When phenotype distributions deviate from normality, 
we check the performance of the four methods and 
again conclude that PCA is optimal. Although a trans- 
formation of the non-normality, e.g. using Box-Cox or 
simply log transformations, would improve power to be 
similar to that for normally distributed variables, we 
show through simulation that the consequence would be 
decrease in power if we did not remove non-normality 
in our phenotype distribution. Comparing Figure 1 and 
Supplementary Figure 2, PCA appears to be less sensi- 
tive to the assumption of normality. 

In practice it is not uncommon to have missing values, 
which may differently affect the four methods. The uni- 
variate approaches may be less affected in the sense that 
if a missing value exists in one of the phenotypes, we are 
still able to select a minimal p-value among the other 
phenotypes. But the standard PCA and MANOVA re- 
quire that all values be present for the phenotypes. For- 
tunately, there are methods available to get around the 
problem of missing data, for example imputation. So we 
can predict and fill in missing values before implementing 
any method to analyse the multi-phenotype data. 



Additionally, an R package, pcaMethods, allows perfor- 
ming PCA on incomplete data and may be used for miss- 
ing values estimation [17]. 

Working with the different methods brought several 
issues to our attention. We found that using Bonferroni 
correction and permuted p-value performed comparably 
in terms of power in the two univariate approaches. At 
first glance, this result seems surprising, given that the 
Bonferroni correction is known to be conservative when 
phenotypes are strongly correlated. However, it is worth 
noting that we use a calibrated threshold throughout 
such that power is 5% under the null hypothesis and, 
thus, we have a fair comparison in terms of power. As 
shown in the left panel of Supplementary Figure 1, had 
we applied a fixed threshold 0.05, the one-by-one ap- 
proach using Bonferroni correction would be associated 
with less power than the permuted p-value approach. 
However, this perceived power difference is not real as it 
was due to different type 1 error rates when the null hy- 
pothesis is true. When considering the burden of com- 
putation time in permutation testing and its relatively 
poor performance compared with PCA in most of the 
model settings, we conclude that neither of the two uni- 
variate methods performs optimally. 

Conclusions 

Using simulation we demonstrated that PCA provides 
the greatest power when applied to both correlated phe- 
notypes and large numbers of phenotypes. The multi- 
variate approach had low type I error only with 
independent phenotypes or small numbers of pheno- 
types. Despite increasing awareness of how to deal with 
multiple phenotypes, the one-by-one approach is still 
commonly employed by researchers. Examples of using 
the other methods in real data application are not often 
seen. In this study, our application of the four methods 
to schizophrenia data provides converging evidence of 
the relative performance of the methods. We propose 
that our comparison will provide some insight into the 
properties of the methods and help researchers to 
choose appropriate strategy in future experimental 
studies. 

Additional file 



Additional file 1: Supplementary material and method. A PDF 

containing description in detail of schizophrenia data used for testing the 
four methods, methodology addressing the issue of selecting a proper 
threshold to declare significance, and results of the performance of the 
four methods when phenotype distributions deviate from normal. 
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